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1 Introduction 



I 1 Data clustering has various applications in a wide variety of fields ranging from social and biological 

^ science, to b „ sill e ss , i— » retried, n-dd- ^ ^ da te ^ C,„ ste H„ g 

refers to the process of grouping data based only on information found in the data which describes 
its characteristics and relationships. Although humans are generally very good at discovering 
patterns and classifying objects, clustering algorithms are able to discern similarities in data even 
I when humans are not [5J. The main focus of our research has been document clustering, but we 

will demonstrate that our methods also work nicely on scientific data. 

In this paper, we propose an adaptation of the clustering algorithm known as Principal Di- 
rection Divisive Partitioning (PDDP) developed by Daniel Boley in [2 which is based Principal 
Components Analysis (PCA). PC A involves the eigenvector decomposition of a data covariance 
matrix, or equivalcntly a singular value decomposition (SVD) of a data matrix after mean center- 
ing. The name of our adaptation, Principal Direction Gap Partitioning (PDGP), borrows most of 
its name from PDDP as it follows many of the same steps that PDDP follows. The word "gap" 
replaces the word "divisive" in reference to how the algorithm splits data along natural gaps at 
each step. This concept will be further developed in the following sections, but it should be noted 



X 

that PDGP is still a divisive algorithm in the same way that PDDP is. 



2 Mathematical Notation and Background 

In order to fully understand how and why PDDP works, we will begin with a detailed description 
of the linear algebra and geometry which support the algorithm. 



Definition 1. The Singular Value Decomposition (SVD) For each CG 3?™ x ™ of rank r, there are 
orthogonal matrices 

U mXm = [tti|«2|». |«m] and V nxn = [Vi\v 2 \...\v n ] 



1 



and a diagonal matrix D rxr = diag{tj\, o~2, ■■■,oy) such that 

( D \ T J-^ rp 

C= U\ V = > a t Uivf with a x > a 2 > ... > oy > 0. 

\ f-f 

The o~i 's are the nonzero singular values of C, and the respective columns Uj and the Vj are the 
left-hand and right-hand singular vectors for C. 

2.1 Directions and Lines of Principal Trend 

The principal trend in data can be considered in two ways. In principal component analysis (PCA) 
the direction of principal trend is considered the direction in which the variance (or spread) of the 
data is maximal [3]. Another way to define the principal trend is by means of least squares, in 
which case the trend is along a line L for which the total sum of squares of orthogonal deviations 
from L is minimal among all lines in K n . The concepts of maximal spread and minimal deviations 
are equivalent in this context. For the sake of subsequent developments, we present the details of 
this fact below. 

For a matrix A. mxn = [ai|a2| . . . |a„] of column data, we define the mean and variance, respec- 
tively, as follows: 
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Where e is a vector of all ones and || * ||f is the Frobenius matrix norm. We will refer to a 

II c II 

centered matrix, C = A fie T , whose mean is zero and variance is 11 ^ F . A trend line L(x,p) = 
{ax + p\a € 1R} for a data cloud in K m is defined by a direction vector x G R m with ||x|J 2 = 1 and 
a point p e K m . See Figure [T] . 



2.1.1 Minimum Deviation Trend Line 

The minimum deviation trend line is the line L for which the total sum of squares of orthogonal 
deviations between the data and L is minimal among all lines in K m . To determine L, let a} denote 
the orthogonal projection of a^ onto a line L(x,p). This orthogonal projection is given by 

a) = xx r (aj - p) + p 

and thus the difference between &j and the closest point on L(x,p) is &j — a} = (I — xx T )(aj — p). 
Consequently, the minimum deviation trend line is located by finding x, p G R m that solves the 
minimization problem 

mm /(x,p) 

X,p, ||x|| 2 — 1 
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£(x, p) = ax + p (Trend line) 




P (Point on trend line) 



X (Direction of trend line) 



Figure 1: Trend Line 



where the objective function is 



/(x, p) = ]T \\ aj - gj\\l = ||(I - xx T )(A - pe T )|| 

3=1 



The following theorem precisely characterizes the minimum deviation trend line. 

Theorem 1 (Minimum Deviation Trend Line). The minimum deviation trend line for the column 
data in A is given by 

L = {aux(C) + n\a G R} 
where U\{C) is the principal left-hand singular vector of the centered matrix 

C= A-fie T = A(I- ee T /n) 

Proof. Apply straightfoward differerentiation to the function f(x,p), and begin by looking for points 
p that satisfy 

o = |p = {df/d Pl ,...,dfidp m ) T 

Letting Q = I xx T and using Q 2 = Q = Q T (since ||x|| = 1) yields 

/(x, p) = trace{[A T - ep T ]Q[A - pe T ]) 

= trace(A T QA) — 2tr ace( A T Qpe T ) + trace (ep T Qpe T ) 
= trace(A T QA) - 2np T Q/i + np T Qp 



so that 



consequently, 



df 
dp 
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-2nQ/i + 2nQp 



Q(p-//) = 



and thus, p = ax + /i where a = x T (/i — p). In other words, regardless of what x turns out to 
be, a minimizing point p necessarily lies on the line L(x,/i). Thus, to find the direction vector, x, 
which minimizes f(x,/i), observe that 

/(x,„) = ||(I - xx T )(A - M e T )||| = ||(I - xx T )C||| = ||C||| - ||C T x||2 

so the minimum of f(x,/i) is obtained precisely at points where max ||C T x||2 is obtained. It is 

11*112=1 

well known that 

max ||C T x||2 = ||C T ||^=a?(C) 
N| 2 =i 

occurs at x = ui(C), and thus the minimum deviation (or total least squares) trend line is 

L = {aui(C) + n\a E R} 

□ 

2.1.2 Maximum Variance Trend Line 

Another natural way to gauge the principal trend of the data is to locate the line L g M. m along 
which the data is most spread-i.e., the line along which the variance is maximal. Since the orthog- 
onal projection of &j onto any line L(x,p) is a) = xx T (a^ — p) + p, The directional variance along 
L(x,p) is 

Var[A] = H A ~^A e IIf whcrc a = (I - xx T )pe T + xx T A 

n 

Since /i^ = Ae/n = (I — xx T )p + xx T /i, it follows that 

A — MA eT = xx;T (A - ue T ) = xx T C 

and thus, 

- ||xx^C||| _ ||C T x||l 



Var[A] 

n n 

So by the same reasoning above, the direction vector x that maximizes the directional variance 
V^ar[A] is also Ui(C) 

Definition 2 (The Principal Trend Line). The principal trend line for the column data in A 
is defined to be 

L = {aui + n\a E R} 

and it represents both the line of minimal total deviations as well as the line of maximal variance. 
Note: Unless otherwise stated, it is hereafter understood that U\ = U\{C) is the principal left-hand 
singular vector of the centered matrix C '= A — fie T = A(I— ee T /n) 

2.1.3 Principal Partitions 

The first step in making principal partitions is to divide the data into two disjoint sets by slicing it 
with an affine hypcrplane P — + /i that is orthogonal to the principal trend line L = a\i\ + /i. 
As depicted in Figure [2] it is natural to put the points that are on one side of P (say the points 
"in front" of P, as depicted in the figure) into one group and to put points on the other side (the 
points "behind" P) into another group. 
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Figure 2: Data Cloud Partitioned by Affine Hyperplane 



The distinction between "front" and "back" is simply made by determining whether the pro- 
jection slj = Uiu^(aj — fi) + jti of a data point a.j onto the principal trend line L = aui + [i lies to 
one side of fj, or the other. Since &j — fi = otjUi for some Oj, the sign of ctj determines the side of 
P that a) and a., are on. Since ctj — u^(aj — fi) and since a,- — /z = Cj is the j th column of the 
centered matrix C, it follows that 

[ax, a 2 , . . . , a n ] = [uf ci, uf c 2 , . . . , uf c n ] = uf C = a^J 

where vi is the principal right-hand singular vector of C that is associated with the largest singular 
value, a i . The fortunate aspect of this observation is that once the SVD of C has been computed, 
the vector aivj is immediately available. Furthermore, since only the signs of the components in 
u^C are needed to determine to which side of P the respective columns in A lie, and since a\ > 0, 
it is evident that the principal partition is determined simply by inspecting the signs of the entries 
in Vi. 

Definition 3 (The Principal Partition). The principal partition of the column data in A is 
determined by the signs of the entries in the principal right-hand singular vector, V\ of the centered 
matrix C. Columns in A corresponding to positive signs in Vi are placed in one cluster while 
columns corresponding to negative signs are placed in another cluster. A column associated with a 
zero entry in V\ may be arbitrarily assigned to either cluster. 

3 Principal Direction Divisive Partitioning 

Once the principal partition of the data has been made, there are several options for making 
further partitions. One such option is the principal direction divisive partitioning (PDDP) scheme 
proposed by Daniel Boley [2]. This algorithm suggests we make the principal partition and then 
examine both clusters to determine which has the maximal variance, or scatter. This cluster of 
maximal variance is then repartitioned across its own principal trend line, separating the data into 
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a total of three disjoint sets, and the process continues by repartitioning the cluster of maximal 
variance each time, producing any desired number of disjoint (hard) clusters. At each step of 
PDDP the projected data is split by a principal partition. 

4 Principal Direction Gap Partitioning 

Principal Direction Gap Partitioning (PDGP) is our adaptation of PDDP which takes into account 
natural gaps which identify clusters in the data. We will motivate our algorithm with some the 
discussion of some geometrical scenarios in which PDDP breaks down. 

4.1 Motivation 

The technique of clustering the column data in A by means of principal partitions is appealing 
because it is easily implemented by simply inspecting the signs of the principal right-hand singular 
vector of C. However, superior results can often be obtained if we are willing to compromise 
this simplicity slightly to look for natural gaps in the data. For example, suppose that the data 
naturally clusters into three distinct gaps as shown in Figure [3j 



C = CKUi + fj, 




Figure 3: Three Data Clouds along Principal Direction 



If this data is partitioned by P using the signs of vi, then the middle cluster is unnaturally 
sliced into two pieces. It seems more reasonable to shift P and partition the data with an afhnc 
hyperplane (aui + /i) + uf- as shown in Figure [4] where a is chosen to put the shifted hyperplane 
into the largest gap in the data. 

Gaps between clusters are easily detected by projecting the columns of A onto the principal 
trend line and measuring the gaps between adjacent points. 

As admitted in [2], the choice of splitting the projected data at zero is somewhat arbitrary 
because it is based on the assumption that the mean of the data will naturally fall in between two 
well separated clusters. It is easy to see when this assumption might fail, for example in the case of 
unbalanced cluster sizes. Figures [5] and [6] show two real world examples in which this assumption 
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Figure 4: Partition by a Shifted Affinc Hypcrplanc 



fails. In these two graphs the entries in the principal right-hand singular vector, vi, are plotted 
in increasing order. The black line depicts the split that PDDP will make. The PDGP algorithm 
splits at the gap that naturally clusters the data. 

Sometimes the division made by PDDP and PDGP coincide. This indicates a situation when 
the assumption that two clusters are separated by the mean holds true. Figure [7] 

4.2 Description of the Algorithm 

The PDGP algorithm is identical to the PDDP algorithm aside from where the data is split at each 
step. After the data is projected onto the principal trend line, PDDP splits the data at while 
PDGP splits the data at the largest gap between the points. To further clarify this, we propose 
the following definition. 

Definition 4 (Gap Partition). Sort the components of the first right-hand singular vector, v = V\ 
in ascending order and label the sorted vector s. Let p be the permutation required for this sort, 
i.e. s\ = v Pl < v P2 = S2 < • • • < v Prl = s n . If the maximum value of s occurs at Sk then the 
gap partition of v, which provides the indices of the column vectors that should be placed in the 
respective cluster, is: 

„ Jtti = \Pl,---,Pk] 

(tT2 = [Pfc+1, ■ ■ ■ ,Pn] 

4.2.1 Fringe Effect and Fringe Tolerance 

One obstacle in the implementation of this algorithm is something we call the fringe effect. This is 
where the gap in a vector vi occurs very close to the ends of s. These "fringe gaps" , if taken into 
account, would separate the data into severely unbalanced clusters, one containing almost all of 
the data and the second containing a mere few. Because the fringe points are often depict outliers 
or noise, this issue must be addressed. 
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Figure 5: Principal Partition of projected data points 



For an example, see Figure [5J Notice on either end of the the outlying data points that create 
large gaps. The human eye is likely to find 3 or 5 clusters in this image, depending on whether you 
decide the first and last points belong to their own cluster or not. Since one of our goals is to find 
relatively balanced clusters, the ideal split appears to be between 18 and 19 or between 25 and 26. 
The line shown is how PDDP would split the data into clusters. 

To counteract this phenomenon we created a "fringe tolerance", t, to control the balance of 
cluster sizes. We ignore a percentage of the projected data points at each end of the graph. For 
our experiments we have ignored a total of 20 percent (t = .2), or 10 percent from each end. In 
choosing this particular value, we are insisting that the algorithm not separate the number of data 
points in a cluster into any ratio larger than 9:1. The fringe tolerance can be changed as the data 
set changes. Intuitively for smaller data sets the fringe tolerance should be higher, and for larger 
data sets it should be smaller, especially for a large number of clusters, as a lower percent still 
encompasses many data points. 

The PDGP algorithm is identical to the PDDP algorithm aside from where the data is split at 
each step. After the data is projected onto the principal trend line, PDDP splits the data at 
while PDGP splits the data at the largest gap between the points. 



PDGP Algorithm 

1. Input: Data matrix A, desired number of clusters 

2. Determine the cluster of maximal variance, use these data vectors to form a matrix M. 
To begin with, this will be your entire data collection A. 

3. Calculate the SVD of the centered matrix C = M — fie T 
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Figure 6: Principal Partition of projected data points 



4. Calculate the gap partition of the first right-hand singular vector, ignoring the proportion 
of data t/2 at the beginning and end of the sorted vector, s. 

5. Repeat steps 2-4 until k clusters are created. 



5 Document Clustering 

The main focus of our research has been in the realm of document clustering. Data taken from 
a group of text documents is traditionally stored in an mxn term-by- document matrix where 
the m rows correspond to the various terms extracted from the documents and the n columns 
correspond to individual documents. The terms extracted from the document list are filtered 
through a"stoplist" of common words to remove terms like "is," "the," and "however." The Ay- 
entry of this matrix is the number of times term i occurs in document j . 



5.0.2 Term Weighting 

In the field of text-mining, the raw term-frequences in the term-document matrix, A, are generally 
weighted in an effort to downplay the effects of commonly used words and bolster the effect of rare 
but semantically important words. In another approach the columns of A can be normalized so 
that lengthy documents do not overshadow their terse counterparts. In this paper we use TFIDF 
(term frequency - inverse document frequency) weighting or normalization scaling to pre-process 
our text data prior to clustering. For comparison to the PDDP algorithm we use the variant of 
TFIDF given in [5] as used in 0. This variant of TFIDF is as follows: 
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Figure 7: PDDP and PDGP divisions coincide 



Definition 5. Term Frequency - Inverse Document Frequency (TFIDF) weighting The Term Fre- 
quency - Inverse Document Frequency (TFIDF) weighting for a matrix A is J5jj, 

^ 2 rnaxj. ( a^j ) ^ 2 number of documents containing term i 
For normalization scaling, each document is normalized to have unit Euclidean length: 



Sfc a kj 

This can alternatively be thought of normalizing each document vector 

It should be noted however, that in later sections of this paper we discuss the use of scientific 
data as well as the use of textual data. The rationale for term weighting in scientific data no 
longer applies because there are not documents of different length to contend with. Although 
normalization is commonly used in scientific data, it is unnecessary when the values of a variable 
are physically constrained to stay in a reasonable range 



5.1 Cluster Evaluation 

Cluster evaluation or validation is an important aspect of any cluster related research. Since most 
existing clustering algorithms will determine clusters in data whether or not they exist naturally, 
it is important to have some way to evaluate the accuracy of clustering results. Cluster evaluation 
measures are typically broken into two catagories, internal (or unsupervised), and external (or 
supervised). Internal measures use no outside information, such as class or catagory labels, to 
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Figure 8: "Fringe values" 



determine the validity of the clustering. Internal measures are typically measures of cluster cohesion 
and separation. Cluster cohesion gives us an idea of how dense an individual cluster is while cluster 
separation tells us how distinct or separated the clusters are from each other. The most commonly 
used internal measure is the silhouette coefficient, which combines both cohesion and separation. 
Since internal measures do not tell us explicitly about the accuracy of our clustering results, we do 
not use them in this paper. [6j. 

External measures use information not included in the dataset (such as class labels) to determine 
how well the algorithm clustered the data into its pre-determined clusters. External measures are 
not useful in practice because there is no need to cluster data which is already catagorically assigned, 
but they give us a more accurate metric for comparing different clustering algorithms. The most 
common external measure for cluster evaluation is entropy and is described in detail at the end of 
the paper. A smaller entropy value indicates a higher quality clustering. It is important to note 
that we have used normalized entropy so that the values fall between and 1. For this reason, 
our entropy values for PDDP run on the same data sets as Boley's original experiments in [5] will 
differ by the multiplicative constant log 2 (fc) where k is the actual number of clusters. 



6 Description of Data Sets and Experimental Results 

Experiments comparing PDDP and PDGP were performed on a series of data sets. We chose not 
only document data sets, but also scientific data to compare the clustering algorithms. 

6.1 J Document Sets 

This document set was used in the original paper on PDDP [2], and consists of 185 documents 
taken from the world wide web. A stop list of common words was applied, and also a stemmer to 
handle verb tenses, plurals, etc. By counting the rest of the words the resulting matrix was called 
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Jl. Further modifcations were made resulting in J2-J11 as seen in table 3 in [2]. 

In running the two algorithms a table similar to table 5 in [5] was created. It is important to 
note that normalized entropy (see section on entropy) was used so that the values would range 
between and 1. To produce tables similar to those in [2; simply scale these tables by a factor 
of /o(72(10) because it was predetermined that the data had 10 clusters. Smaller entropy values 
indicate a better clustering. 

Table 1: Normalized entropy values obtained by PDDP and PDGP on the J document sets using 
norm scaling 



data 
sets 




PDDP 






PDGP 




norm scalin 


g 


norm scaling 


■ 


clusters 


8 


16 


32 


8 


16 


32 


Jl 


0.372 


0.208 


0.154 


0.388 


0.191 


0.147 


J6 


0.399 


0.250 


0.169 


0.398 


0.232 


0.157 


J4 


0.459 


0.331 


0.213 


0.508 


0.319 


0.215 


J3 


0.399 


0.256 


0.183 


0.388 


0.232 


0.154 


J7 


0.408 


0.270 


0.182 


0.393 


0.220 


0.158 


J8 


0.442 


0.288 


0.207 


0.400 


0.230 


0.173 


J2 


0.510 


0.337 


0.229 


0.449 


0.302 


0.214 


J9 


0.496 


0.322 


0.228 


0.507 


0.327 


0.213 


J10 


0.507 


0.351 


0.257 


0.523 


0.381 


0.225 


J5 


0.395 


0.221 


0.155 


0.375 


0.200 


0.155 


Jll 


0.443 


0.315 


0.202 


0.510 


0.343 


0.217 



Table 2: Normalized entropy values obtained by PDDP and PDGP on the J document sets using 
TFIDF term weighting 



data 




PDDP 






PDGP 




sets 




TFIDF 






TFIDF 




clusters 


8 


16 


32 


8 


16 


32 


Jl 


0.440 


0.318 


0.214 


0.551 


0.435 


0.344 


J6 


0.353 


0.232 


0.194 


0.546 


0.401 


0.272 


J4 


0.496 


0.356 


0.281 


0.481 


0.379 


0.292 


J3 


0.472 


0.335 


0.278 


0.487 


0.350 


0.263 


J7 


0.384 


0.275 


0.217 


0.536 


0.401 


0.274 


J8 


0.398 


0.272 


0.230 


0.491 


0.361 


0.300 


J2 


0.469 


0.343 


0.242 


0.512 


0.416 


0.312 


J9 


0.428 


0.308 


0.228 


0.472 


0.311 


0.221 


J10 


0.571 


0.374 


0.296 


0.485 


0.349 


0.275 


J5 


0.322 


0.184 


0.138 


0.439 


0.263 


0.170 


Jll 


0.506 


0.358 


0.277 


0.467 


0.299 


0.228 



It is apparent from the results in table [T] that PDGP is competitive with PDDP when using 
the norm scaling, and frequently provides a clustering with lower entropy. When TFIDF weighting 
(Tablets used, PDDP performs slightly better than PDGP. However, it is experimentally evident 
that norm scaling provides better clustering results overall when compared to TFIDF weighting, 
and thus norm scaling should probably be used in favor of the TFIDF weighting for either of these 
two clustering algorithms. 

6.2 Abalone Data Set 

This data set was obtained from pQ and contains measurements of 4177 different abalone. There 
were 8 characteristics measured: sex (male, female, infant), length, diameter, height, whole weight, 
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shucked weight, viscera weight, and shell weight. The sex variable was assigned to be for a 
male, 1 for a female, and 2 for an infant. These measurements were paired with ages, of which 28 
different ages were determined. We omitted the age variable from the dataset and aimed to cluster 
the organisms based upon this variable. However, there were several age groups containing only a 
few abalone (specifically the older age groups) , and thus the sizes of the clusters are expected to be 
unbalanced. We used both PDDP and PDGP to cluster the data with various numbers of clusters. 
No scaling or normalization was applied to the data set because the values of each variable are 
expected to fall within a natural range. 

Table 3: Normalized entropy values obtained PDDP and PDGP on the Abalone scientific data 



clusters sought 


20 


21 


22 


23 24 


25 


26 


27 


28 




Entropies 


PDDP 


0.624 


0.622 


0.622 


0.622 0.622 


0.620 


0.620 


0.620 


0.618 


PDGP 


0.622 


0.620 


0.618 


0.618 0.616 


0.616 


0.616 


0.614 


0.614 



6.3 Iris Data Set 

This data set was obtained from pQ and contains information on 150 flowers. Each flower was 
measured with four characteristics: sepal length, sepal width, petal length, and petal width. Of 
these flowers there are 3 different species, so the overall data was stored as a 4 x 150 matrix. Again, 
no scaling or normalization was used. PDDP and PDGP were set to run to find 3 clusters. 



Table 4: Normalized entropy values obtained by PDDP and PDGP on the Iris data 







PDDP 






PDGP 




Iris Species 


Cluster 1 


Cluster 2 


Cluster 3 


Cluster 1 


Cluster 2 


Cluster 3 


# of Setosa 


50 








50 








# of Versicolour 


9 


38 


3 





50 





# of Virginica 





14 


36 





34 


16 


Total Entropy 


0.404 


0.347 



6.4 Reuters-10 Document data 

This collection of documents, downloaded from pQ, is a subset of the Reuters collection consisting 
of 20 documents pulled from each of 10 keyword searches for a total of 200 documents. The files 
were read out by 3 Indian speakers and an Automatic Speech Recognition (ASR) system was used 
to generate the transcripts. This dataset was collected to study the effect of speech recognition 
noise on text mining algorithms. Normalization scaling was used. In this noisy data set, PDDP 
provides a slightly lower entropy than PDGP. 

Table 5: Normalized entropy values obtained by PDDP and PDGP on Reuters-10 





Entropy 


PDDP 
PDGP 


.6021 
.6385 
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6.5 Wisconsin Breast Cancer Data Set (Original) 

This data set, also downloaded from [1 consisted of 699 observations of individuals with abnormal 
breast tissue growth obtained from the University of Wisconsin Hospitals, Madison by Dr. William 
H. Wolberg [4j. Each observation consists of 9 measurements such as clump thickness, uniformity of 
cell size and shape. A variable indicating whether a growth was benign or malignant was included 
in the data so we removed it and clustered the observations into two groups hoping to predict this 
variable through clustering. There were 16 missing values in the data which were set to 0. No 
normalization or scaling was applied. Both PDDP and PDGP performed well on this task, though 
PDGP was slightly superior. 

Table 6: Normalized entropy values obtained by PDDP and PDGP on Wisconsin Breast Cancer 
Data 





Entropy 


PDDP 
PDGP 


.0052 
.0029 



7 Conclusion 

PDDP /PDGP are both S VD based clustering algorithms which seek to use the principal trends in 
a given data set to separate related observations/documents into clusters. Where PDDP arbitrarily 
makes this split along the principal direction at the mean, PDGP looks for natural gaps in the 
data. We sought to elucidate the geometrical interpretation of the singular vectors, and argue 
that although PDDP and PDGP often perform comparably, gap partitioning makes more sense 
intuitively. We have explored many variants of these clustering algorithms in our research, and 
have suggested some simple implementations for future research. 

One of the issues at large with the PDGP algorithm is the fringe effect. The tolerance r 
effectively controls the balance of the cluster sizes, but it arbitrarily causes the splitting algorithm 
to ignore a certain percentage of the data projections. There may be other applications that will 
allow for the inclusion of this information, for instance outlier identification. Especially in cases 
where documents are extracted from the world wide web it is likely that some noisy documents 
which have no connection to the other documents will be extracted. However, just because a 
projected point looks like an outlier along the principal directions doesn't mean that it is truly an 
outlier in the context of the whole data set. Looking along secondary directions may provide more 
information to this effect. 



8 Acknowledgements 

We would like to thank the National Science Foundation (NSF) for funding our REU program and 
making our work possible. We are also grateful for the UC Irvine Machine Learning Repository of 
data sets. We downloaded several of the above mentioned data sets from their website, including the 
Wisconsin Breast Cancer Database which is described in detail at http:/ / archive. ics.uci. edu/ml/machine- 
|learning~d atabases/breast-cancer-wisconsin/breast-cancer-wisconsin.namesj. Thanks to Dimitrios 



14 



Zcimpckis and Efstratios Gallopoulos, the creators of the MATLAB Text to Matrix Generator 
(TMG), which was used to parse many of the documents sets used herein. 

References 

[1] A. Asuncion and D. Newman, UCI machine learning repository, 2007. 

[2] D. Boley, Principal direction divisive partitioning, Data Mining and Knowledge Discovery, 2 
(1998), pp. 325-344. 

[3] I. Jolliffe, Principal Component Analysis, Springer Series in Statistices, Springer, 2nd ed., 
2002. 

[4] O. Mangasarian and W. Wolberg, Cancer diagnosis via linear programming, SIAM News, 
23 (1990). 

[5] G. S ALTON and C. Buckley, Term-weighting approaches in automatic text retrieval, Infor- 
mation Processing and Management, 24 (1988), pp. 513-523. 

[6] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 
2005, ch. 8. 



15 



